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Abstract 


We describe a class of systems theory based neural networks 
called “Network Of Recurrent neural networks” (NOR), 
which introduces a new structure level to RNN related mod- 
els. In NOR, RNNs are viewed as the high-level neurons and 
are used to build the high-level layers. More specifically, 
we propose several methodologies to design different NOR 
topologies according to the theory of system evolution. Then 
we carry experiments on three different tasks to evaluate our 
implementations. Experimental results show our models out- 
perform simple RNN remarkably under the same number of 
parameters, and sometimes achieve even better results than 
GRU and LSTM. 


Introduction 


In recent years, Recurrent Neural Networks (RNNs) 
have been widely used in Natural Language Process- 
ing (NLP). Traditionally, RNNs are directly used to build the 
final models. In this paper, we propose a novel idea called 
“Network Of Recurrent neural networks” (NOR) which uti- 
lizes existing basic RNN layers to make the structure design 
of the higher-level layers. From a standpoint of systems the- 
ory (Von Bertalanffy 1968) Von Bertalanffy 1972), a recur- 
rent neural network is a group or an organization made up of 
a number of interacting parts, and it actually is viewed as a 
complex system, or a complexity. Dialectically, every system 
is relative: it is not only the system of its parts, but also the 
part of a larger system. In NOR structures, RNN is viewed 
as the high-level neuron, and several high-level neurons are 
used to build the high-level layers rather than directly used 
to construct the whole models. 

Conventionally, there are three levels of structure in Deep 
Neural Networks (DNNs): neurons, layers and whole nets 
(or called models). From a perspective of systems theory, 
at each level of such increasing complexity, novel features 
that do not exist at lower levels emerge (Lehn 2002). For 
example, at the neurons level, single neuron is simple and 
its generalization capability is very poor. But when a cer- 
tain number of such neurons are accumulated into a certain 
elaborate structure by certain ingenious combinations, the 
layers at the higher level begin to get the unprecedented abil- 
ity of classification and feature learning. More importantly, 
such new gained capability or property is deducible from 
but not reducible to constituent neurons of lower levels. It’s 


not a property of the simple superposition of all constituent 
neurons, and the whole is greater than the sum of the parts. 
In systems theory, such kind of phenomenon is known as 


whole emergence (Wierzbicki 2015). Whole emergence of- 
ten comes from the evolution of the system (Arthur and oth- 


ers 1993), in which a system develops from the lower level 


to the higher level, from simplicity to complexity. In this 
paper, the motivation of NOR structures is to introduce a 
new structure level to RNN-related networks by transferring 
traditional RNN from the system to the agent and from the 
outer dimension to the inner dimension (Fromm 2004). 

In 1993, W. Brian Arthur has 
identified three mechanisms by which complexity tends to 
grow as systems evolve: 


e Mechanism 1: increase in co-evolutionary diversity. The 
agent in the system seem to be a new instance of agent 
class, type or species. As a result, the system seems to 
have new external agent types or capabilities. 


e Mechanism 2: increase in structural sophistication. The 
individual system steadily accumulates increasing num- 
bers of new systems or parts. Thus, newly formed system 
seems to have new internal subsystems or capabilities. 


e Mechanism 3: increase by “capturing software”. The 
system capture simpler elements and learns to “program” 
these as “software” to be used as its own ends. 


In this paper, with the guidance of first two mechanisms, 
we introduce two methodologies to NOR structures design, 
which are named as aggregation and specialization. Aggre- 
gation and specialization are natural operations for increas- 
ing complexity in complex systems (Fromm 2004). The for- 
mer is related to Arthur’s second mechanism, in which tra- 
ditional RNNs are aggregated and accumulated into a high- 
level layer in accordance with a specific structure, and the 
latter is related to Arthur’s first mechanism, in which the 
RNN agent in a high-level layer is specialized as the RNN 
agent that performs a specific function. 

We make several implementations and carry out experi- 
ments on three different tasks, including sentiment classifi- 
cation, question type classification and named entity recog- 
nition. Experimental results show that our models outper- 
form constitute simple RNN remarkably with the same num- 
ber of parameters and achieve even better results than GRU 
and LSTM sometimes. 


Background 
Systems Theory Systems Theory was originally proposed 


by biologist Ludwig Von Bertalanffy (Von Bertalanffy 1968; 
Von Bertalanffy 1972) for biological phenomena. In biol- 


ogy systems, there are several different levels which begin 
with the smallest units of life and reach to the largest and 
most extensive category: molecule, cell, tissue, organ, or- 
gan system, organization etc. Traditionally, a system could 
be decomposed into its individual components so that each 
component could be analyzed as an independent entity, and 
components could be added in a linear fashion to describe 
the totality of the system (Walonick 1993). However, Von 
Bertalanffy argued that we cannot fully comprehend a phe- 
nomenon by simply breaking it down into elementary parts 
and then reforming it. We instead need to apply a global and 
systematic perspective to underline its functionality 


Pels, and Polese 2010), because a system is characterized 


by the interactions of its components and the nonlinearity of 


those interactions (Walonick 1993). 


Whole Emergence In systems theory, the phenomenon 
(i.e., the whole is irreducible to its parts) is known as emer- 


gence or whole emergence (Wierzbicki 2015). Emergence 


can be qualitatively described as: the whole is greater than 


the sum of the parts (Upton, Janeka, and Ferraro 2014). Or 


it can also be quantitatively expressed as: 
W> Soni (1) 


where W is the whole of the system and consists of n parts, 
and {p;}j;—1...n is the i-th part. In 1972, Philip W. Anderson 
highlighted the idea of emergence in has article “More is 
Different” in which he stated that a change 
of scale very often causes a qualitative change in the behav- 
ior of the system. For example, in human brains, when one 
examines a single neuron, there is nothing that suggests con- 
scious. But a collection of millions of neurons is clearly able 
to produce wonderful consciousness. 

The mechanisms behind the emergence of complexity 
can be used to design neural network structures. One 
of the widely accepted reasons is the repeated application 
and combination of two complementary forces or opera- 
tions: stretching and folding (in Physics term (Thompson 
and Stewart 2002)), splitting and merging (in Computer Sci- 
E or specialization and cooper- 
ation (in Sociology term). Merging or aggregating of agents 
means generally a number of (sub-)agents is aggregated or 
conglomerated into a single agent. Splitting or specializing 
means the agents are clearly separated from each other and 
each agent is constrained to a certain class or role 


2004). 


Recurrent Neural Networks at the Edge of Chaos Re- 
current Neural Networks (RNNs) 
are a class of deep neural networks that possess inter- 
nal short-term memory due to recurrent feed-back connec- 
tions between units, which makes them be able to process 
arbitrary sequences of inputs. Formally, given a sequence 


of vectors {x;}4=1...r, the equation of Simple RNN 
1990) is: 


ht = fW r + Uht-1) (2) 


where W and U are parameter matrices, and f denotes a 
nonlinearity function such as tanh or ReLU. For simplicity, 
the neuron biases are omitted from the equation. 

Actually, RNNs can behave chaotically. There have 
been some works analysing RNNs theoretically or experi- 
mentally from the perspective of systems theory. 
provided an exposition of research regarding system- 
theoretic aspects of RNNs with sigmoid activation functions. 
analyzed the computa- 
tion at the edge of chaos in RNNs and calculated the critical 
boundary in parameter space where the transition from or- 


dered to chaotic dynamics takes place. (Pascanu, Mikolov, 
and Bengio 2013) employed a dynamical systems perspec- 


tive to understand the exploding gradients and vanishing gra- 
dients problems in RNNs. In this paper, we obtain method- 
ologies from systems theory to conduct structure designs of 
RNN related models. 


Network of Recurrent Neural Networks 
Overall Architecture 


=>» 


Figure 1: Overview of NOR structure. 


As the high-level illustration shown in Figure[]] NOR ar- 
chitecture is a three-dimensional spatial-temporal structure. 
We summarize NOR architecture as four components: I, M, 
S and O, in which component I (input) and O (output) con- 
trol the head and tail of NOR layer, component S (subnet- 
works) is in charge of the spatial extension and component 
M (memories) is responsible for the temporal extension of 
the whole structure. We describe each component as fol- 
lows: 

Component I: Component I controls the head of NOR ar- 
chitecture. It does data preprocessing tasks and distributes 
processed input data to subnetworks. At each time-step 
t, the form of upcoming input data x, may be various, 
such as one single vector, or several vectors with the multi- 
granularity information, even the feature vectors with noise. 
One single vector may be the simplest situation, and the 
common solution is copying this vector into n duplicates and 
feed each of them into one single subnetwork in the compo- 
nent S. In this paper, the copying method meets our needs 
and we formalize it as: 


(a) MA-NOR layer. (b) Another MA-NOR layer. 


es 
ORD 


(c) MS-NOR layer. 


(d) SS-NOR layer. 


Figure 2: The sectional views of NOR layers at one time-step. “T” means the component I, “O” means the component O, and 


“R” means RNN neuron. 


in which C means copy function, and 2 will be fed into i-th 
subnetwork. 

Component M: Component M manages all memories 
over the whole layer, not only internal but also external 
memories (Weston, Chopra, and Bordes 2014). But in this 
paper, component M only considers internal memory and do 
not apply any extra processing to the individual memory of 
each RNN neuron. That is: 


mi = Ioi) (9 
where J means identity function, the superscript j is the 
identifier of j-th RNN neuron ['] m? is the memory of j- 
th RNN neuron at time-step t and o? _; is the transformation 
output of j-th RNN neuron at time-step t — 

Component S: Component S is made up of several dif- 
ferent or same subnetworks. Interaction may exist in these 
subnetworks. The responsibility of component S is to man- 
age the logic of each subnetwork and handle the interaction 
between them. Suppose component S has n in-degree and 
m out-degree, i.e., component S receives n inputs and pro- 
duces m outputs, the k-th output is generated by necessary 
inputs and memories: 


st = f([X, M]) (5) 


where s¥ is the k-th output at time-step t, X and M are 
needed inputs and memories, and f is the nonlinear function 
which can be one-layer RNN or two-layer RNN etc. 
Component O: To form a layer we need a certain amount 
of neurons. So one of the NOR properties is multiple RNNs. 
A natural approach to integrate multiple RNN neurons’ sig- 
nals is collecting all outputs first and then using a MLP layer 
to measure the weights of each outputs. Traditional neuron 
outputs a single real value, so the collection method is di- 
rectly arranging them into a vector. But RNN neurons is dif- 
ferent, for each of them outputs a vector not a value. A sim- 
ple method is concatenating all vectors and then connecting 
the concatenated vector to the next MLP. Another is pooling 
each RNN output vector into a real value, then arranging all 
these real values into a vector, which seems same as tradi- 
tional neurons. In this paper, the former solution is used and 


'The notation of the “subnetwork” is different from the “neu- 
ron”, for one subnetwork may be composed of several neurons. We 
use superscript 7 as the identifier of the subnetwork. So, the input 
data of i-th subnetwork at time-step t is denoted as 2}. 

?In this paper, we just use simple RNN applied 
with ReLU activation as our basic RNN neuron. Thus the memory 
at this time-step is just the output at last time-step. 


formalized as: 
si = [siete s7] (6) 
ot = r(WmLP * $t) (7) 


where s; is the concatenated vector, Wm zp is the weight of 
MLP and r means the ReLU activation function of MLP. 


Methodology I: Aggregation 


Any operation with changing a boundary can cause a emer- 
gence of complexity. The natural boundary is the agent it- 
self, and sudden emergence of complexity is possible at this 
boundary if complexity is transfered from the agent to the 
system or vice versa from the system to the agent. There 
are two basic operations, aggregation and specialization, that 
can be used to transfer complexity between different dimen- 

According to Arthur’s second mechanism, internal com- 
plexity can be increased by aggregation and composition of 
sub-agents, which means a number of RNN agents is con- 
glomerated into a single big system. In this way, aggregation 
and composition transfer traditional RNN from the outer to 
the inner dimension, from the system to the agent, for the 
selected RNNs are accumulated to become a part of a larger 
group. 

For a concrete NOR layer, suppose it is composed of n 
subnetworks, and 7-th subnetwork is made up of k? RNN 
neurons. Then, at the time-step t, given the input x+, the 
operation flow is as follows: 


1. Component I: copy x; into n duplications using equation 
(3), then we get zł, x2,- ,a?. 


2. Component M: deliver the memory of each RNN neu- 
ron from the last time-step to the current time-step using 


: ‘ 1,1 1l,k! . 
equation (4), then we get memories m;,",--- , m" in 
. 2,1 2,k? . 
first subnetwork and memories m;’,--: , m4 in sec- 


ond subnetwork, etc. 
3. Component S: for each subnetwork 7, take advantage of 


i, i,k! 
the input x} and memories m;",--- , mg” to get the non- 


linear transformation output: 


then, we get s},s7,--- , s7. 


4. Component O: concatenate all outputs by equation (6) 
and use a MLP function to determine how much signals 
in each subnetwork to flow through the component O by 


equation (7). 


Obviously, the number, the type and the interaction of the 
aggregated RNNs determine the internal structure or inner 
complexity of the newly formed layer system. Thus, we pro- 
pose three kinds of topologies of NOR aggregation method. 


Multi-Agent In systems theory, the natural description of 
complex system is the multi-agent system created by repli- 
cation and adaptation. “Replication” means to copy and re- 
produce a new RNN agent, and “adaptation” means they are 
not totally same and some changes on weights or somewhere 
else by variation can increase the diversity of the system. 

As shown in Figure[2(a)| there is a NOR layer (called MA- 
NOR) composed of four parallel RNNs. Figure[3|shows this 
layer being unrolled into a full network. Each subnetwork 
of MA-NOR layer is a one-tier RNN, thus at time-step t, the 
i-th subnetwork of component S in MA-NOR is calculated 
as: 


si =o} = r(Wixt + Uim?) (9) 


where r means ReLU activation function, W; and U; are pa- 
rameters of corresponding RNN neuron, o} is the nonlinear 
transformation output and will be delivered to next time-step 
to be used as mi4, and s$ is the output of i-th subnetwork 


which is equal to 0%. 
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Figure 3: The unfolding of MA-NOR in three time-steps. 


The nonlinear function in equation (5) of each subnetwork 
may be more complex. For example, Figure shows a 
NOR layer made up of three two-tier RNNs. At time-step t, 
the 2-th subnetwork in component S is calculated as 


of! = r(Wp' at + UP mp") (10) 
si =o)” = r(Wpat + Ui?’ mi?) (11) 


Multi-Scale The combination of multiple RNNs in a 
Multi-Agent NOR layer makes it somewhat like an ensem- 
ble. And empirically, diversity among the members of a 
group of agents is deemed to be a key issue in ensemble 
structure (Kuncheva and Whitaker 2003). One way to in- 
crease the diversity is to use the Multi-Scale topology which 
introduce new agent type to the system and can learn se- 
quence dependencies in different timescales. 

Figure [2(c)]shows a NOR layer made up of four subnet- 
works, in which two of them are one-tier RNNs and the oth- 
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Figure 4: The unfolding of MS-NOR in three time-steps. 
ers are two-tier RNNs. Two kinds of timescale dependen- 


cies are learned in component S, which are formalized as 
follows: 


sl =o} = r(W xt + Ui ml) (12) 
s? =0 = r(W?2? + U? m?) (13) 
of" = (Wea? + UP mp") (14) 
s3 =08"? = r(Wp? o1 + UR? mè?) (15) 
ae = r(Wet a4 + Ue me") (16) 
a0," =A Wa + Up? me’) (17) 


Self-Similarity The above mentioned aggregation and 
composition operation lead to big RNN agent-groups. While 
in turn, they can also be combined to form even bigger 
group. Such repeated aggregation and high accumulation 
makes the fractal and self-similar structure come into being. 


Figure 5: The unfolding of SS-NOR in three time-steps. 


As shown in Figure we also use three paths. But 
after each path first learns its own intermediate represen- 
tation, the second layers gather all intermediate represen- 
tations of three paths to learn high-level abstract features. 
In this way, different paths do not learn and train indepen- 
dently. The connections among each other helps the model 


easy to share informations. Thus, it becomes possible that 
the whole model learns and trains to be an organic rather 
than parallel independent structure. We formalize the coop- 
eration of component S as follows: 


op = (Wy a + Up me”) (18) 
o2! =r Wx? + U2 m3) (19) 
of” = r(Wp tx} + Up me”) (20) 
sf mob? = r(W2%fol4, obt, of" + Ub!) en 
s? =o; =r wP?’ [op on", a | + U??? m?) (22) 
8? =o}? = r(Wp° Joi 07", of | +U}? m???) 23) 


Methodology II: Specialization 


We have mentioned that the emergence of complexity is usu- 
ally connected to a transfer of complexity, a transfer at the 
boundary of the system. Aggregation and composition trans- 
fer complexity from the system to the agent, and from the 
outer dimension to the inner dimension. Another way to be 
used to cross the agent boundary is the specialization or in- 
heritance, which transfer complexity from the agent to the 
system, and from the inner dimension to the outer dimen- 
sion (Fromm 2004). Specialization is related to Arthur’s first 
mechanism. It increases structural sophistication outside of 
the agent by adding new agent forms. Through inheritance 
and specialization objects become objects of a certain class 
and agents become agents of a certain type, and the more 
such an agent becomes a particular class or type, the more it 
needs to delegate special tasks that it can not handle alone to 
other agents (Fromm 2004). 

The effect of specialization is the emergence of delegation 
and division of labor in the newly formed groups. Thus, the 
formalization of k-th output in Component S can be rewrit- 
ten as the following: 


st = 9(fi([X, MJ), fo([X, M]),--» , fu([X,M])) (24) 
where fı is the /-th specialized agent function, g means the 
cooperation of all specialized agents, and L is the number 
of specialized agents. Equation denotes the function f 
in equation is implemented by the separated operations 
fafa ÍL and g. 


Gate-Specialization We see gate mechanism is one of the 
specialization methods. As shown in Figure [6] a general 
RNN agent is separated into two specialized RNN agents, 
one is for gate duty and the other is for generalization duty. 
A concrete Gate-NOR is shown in Figurd7| In the original 
Multi-Agent NOR layer, each RNN agent is specialized as 
one generalization specific RNN and one gate specific RNN. 
We formalize it as: 


ol = o(Wiat + U} mt) (25) 
of = r(W?ax? + UZm?) (26) 
s= Oo, (27) 
o = o(Wpr? + U?m3) (28) 
of = r(Wfat + Us mi) (29) 
s? = of © of (30) 


RNN Agents 


Specialize l 


Generalization 
Specialized RNN 


Gate Specialized 
RNN 


Figure 6: Gate Specialization. 


Figure 7: The sectional views of Gate-NOR layer at one 
time-step. 


where o denotes the sigmoid activation and © denotes 
element-wise multiplication. 


Relationship with LSTM and GRU We see Long Short- 
term Memory (LSTM) (Hochreiter and Schmidhuber 1997} 
and Gated Recurrent Unit (GRU) (Chung et al. 2014) as two 
special cases of Network of Recurrent neural networks. Take 
LSTM for example, at time-step t, given input x, and previ- 
ous memory cell c+—ı and hidden state h,_1, the transition 


equations of standard LSTM can be expressed as the follow- 
ing: 


i = o(W;xı + U;hy_1) (31) 
f = o(Wixı + Uphe-1) (32) 
o = o(Woat + Uoht-1) (33) 
g = tanh(W,x; + Ughi-1) (34) 
G=4¢10ft+gOr (35) 
st = tanh(cz) Oo (36) 


From the perspective of NOR (Network of Recurrent Neural 
Networks), LSTM is made up of four RNNs, in which three 
of four (i.e., 7, f, o RNNs) are specialized for gate tasks to 
control how much of informations let through in different 
parts. Moreover, there is only a shared memory h;—; which 
can be accessed by each RNN cell in LSTM. 

While in turn, LSTM and GRU can also be combined to 
form even bigger group. 


Experiments 


In order to evaluate the performance of the presented model 
structures, we design experiments on the following tasks: 


Task # of Params. | IRNN | GRU | LSTM | MA-NOR | MS-NOR | SS-NOR | Gate-NOR 

200 k 212 107 88 89 66 61 61 

Sentiment Classification 400 k 320 166 139 136 100 90 97 
800 k 468 252 213 203 149 132 149 

100 k 198 86 68 74 54 53 45 

Question classification 200 k 319 148 119 122 88 83 79 
400 k 497 244 199 193 139 126 133 

200 k 197 86 67 74 54 53 45 

Named Entity Recognition 400 k 319 148 119 122 88 83 79 
800 k 497 244 199 193 139 126 133 


Table 1: Number of hidden neurons for RNN, GRU, LSTM MA-NOR, MS-NOR, SS-NOR and Gate-NOR for each network 
size specified in terms of the number of parameters (weights). 


sentiment classification, question type classification and 
named entity recognition. We compare all models under the 
comparable parameter numbers to validate the capacity of 
better utilizing the parametric space. In order to verify the 
effectiveness and universality of the experiments, we con- 
duct three comparative tests under total parameters of dif- 
ferent orders of magnitude, see Table [I] Every experiment 
is repeated 20 times with different random initializations and 
then we report the mean results. It’s worthy noting that our 
aim here is to compare the model performance under the 
same hyper-parameter settings, not to achieve best perfor- 
mance for one single model. 


(Le, Jaitly, and Hinton 2015) showed that when initial- 


izing the recurrent weight matrix to be the identity matrix 
and biases to be zero, simple RNN composed of ReLU ac- 
tivation function (named as IRNN) can be comparable with 
even outperform LSTM. In our experiments, all basic RNN 
neurons are simple RNNs applied with ReLU function. We 
also keep the number of the hidden units same over all RNN 
neurons in a NOR model. Obviously, our baseline model is a 


single giant simple RNN (Elman 1990) applied with ReLU 


activation. At the same time, two improved RNNs (GRU 
(Chung et al. 2014) and LSTM 
ber 1997)) have been widely and successfully used in NLP 
in recent years, so we also choose them as our baseline mod- 
els. 


The pre-trained 300-D Glove 6B vector$] and 300-D 
Google News vectors" were obtained for the word embed- 
dings. During training we fix all word embeddings and 
learn only the other parameters in all models. The embed- 
dings for out-of-vocabulary words are set to zero vectors. 
We pad or crop the input sentences to a fixed length. The 
trainings are done through stochastic gradient optimizer de- 
scent over shuffled mini-batches with the optimizer Adam 
(Kingma and Ba 2014). All models are regularized by using 
dropout (Srivastava et al. 2014) method. At the same time, 
in order to avoid overfitting, early stopping is applied to 
prevent unnecessary computation when training. More de- 
tails on hyper-parameters setting can be found in our codes, 
which are publicly available at t. 


*https://nlp.stanford.edu/projects/glove/ 
“https://code.google.com/archive/p/word2vec/ 


Sentiment Classification 


We evaluate our models on the task of sentiment classifi- 
cation on the popular Stanford Sentiment Treebank (SST) 
benchmark (Socher et al. 2013), which consists of 11855 
movie reviews and is split into train (8544), dev (1101) and 
test (2210). SST provides detailed phrase-level annotation 
and all sentences along with the phrases are annotated with 
5 labels: very positive, positive, neural, negative, and very 
negative. In our experiments, we only use the sentence-level 
annotation. One of our goals is to avoid expensive phrase- 
level annotation, like (Qian, Huang, and Zhu 2016). An- 
other is, in practice, phrase-level annotation is hard to pro- 
vide. 

All models use the same architecture: embedding layer > 
dropout layer — RNN/NOR layer — RNN/NOR layer — 
max-pooling layer — dropout Layer — softmax layer. The 
first layer is the word embedding layer, next are two-layer 
RNN/NOR layers as the non-linear feature transformation 
layer. Then a max-pooling layer max-pools all transformed 
feature vectors by selecting the max value in each position 
to get sentence representation. Finally, a softmax layer is 
used as output layer to get the final result. To benefit from 
the regularization, two dropout layers with rate of 0.5 are 
added after embedding layer and before softmax layer. The 
initial learning rates of all models are set to 0.0002. We 
use public available 300-D Glove 840B vectors to initialize 
word embeddings. Three different network sizes are tested 
for each architecture, such that the number of parameters 
are roughly 200 k, 400 k and 800 k (see Table [Ip. We set 
the minibatch size as 20. Finally, we use the Cross-Entropy 
criterion as loss function. 


Model 200k Params | 400k Params | 800k Params 
IRNN 44.30 44.43 45.28 
GRU 47.27 47.35 47.65 
LSTM 46.98 47.19 47.37 
MA-NOR 45.17 45.24 45.37 
MS-NOR 44.83 45.51 45.59 
SS-NOR 44.54 45.21 45.42 
Gate-NOR 44.95 45.80 46.06 


Table 2: Accuracy (%) comparison over different experi- 
ments on SST corpus. 


The results of the experiments are shown in Table[2| It is 


obvious that NOR models get superior performances com- 
pared with IRNN baseline, especially when the network size 
is big enough. All models improve with network size grows. 
Among all NOR models, Gate-NOR gets the best results. 
However, we find that LSTM and GRU get much better re- 
sults in three comparative tests. 


Question Type Classification 


Question classification is an important step in a question an- 
swering system which classifies a question into a specific 
type. For this task, we use TREC bench- 
mark, which divides all questions into 6 categories: location, 
human, entity, abbreviation, description and numeric. TERC 
provides 5452 labeled questions in the training set and 500 
questions in the test. We randomly select 10% of the training 
data as the validation set. 


Model 100k Params | 200k Params | 400k Params 
IRNN 92.73 93.22 93.62 
GRU 91.83 92.54 93.64 
LSTM 91.44 92.46 93.10 

MA-NOR 92.73 93.56 93.73 
MS-NOR 92.85 93.79 94.04 
SS-NOR 93.19 93.62 94.12 
Gate-NOR 92.77 93.35 93.81 


Table 3: Accuracy (%) comparison over different experi- 
ments on TREC corpus. 


All network types use the same architecture: embedding 
layer — dropout layer — RNN/NOR layer — max-pooling 
layer —> dropout layer + softmax layer. Dropout rates are 
set to 0.5. Three hidden layer sizes are chosen such that the 
total number of parameters for the whole model is roughly 
100 k, 200 k, 400 k, see Table[i] All networks use a learning 
rate of 0.0005 and are trained to minimize the Cross Entropy 
Error. 

Table [3] shows the accuracy of the different networks on 
the question type classification task. Here again, NOR mod- 
els get better results than baseline IRNN model. Among 
all NOR models, SS-NOR also gets the best result. In this 
dataset, we find the performances of LSTM and GRU are 
even not comparable with IRNN, which proves the validity 


of results in (Le, Jaitly, and Hinton 2015). 


Named Entity Recognition 


Named entity recognition (NER) is aclassic NLP task which 
tries to identity the proper names of persons, organizations, 
locations, or other entities in the given text. We experiment 
on CoNLL-2003 dataset 
2003) which consists of 14987 sentences in the training set, 
3466 sentences in the validation set and 3684 sentences in 
the test set. 

Recently, popular NER models are based on bidirectional 
LSTM (Bi-LSTM) combined with conditional random fields 
(CRF), named as Bi-LSTM-CRF (Lample et al. 2016). The 
Bi-LSTM-CRF networks can effectively use past and fu- 
ture features via a Bi-LSTM layer and sentence level tag 
information via a CRF layer. In our experiments, we also 


Model 200k Params | 400k Params | 800k Params 
IRNN 85.07 85.52 85.58 
GRU 83.97 84.13 85.05 
LSTM 84.95 85.70 86.18 
MA-NOR 85.96 86.17 86.18 
MS-NOR 85.51 86.10 86.21 
SS-NOR 86.06 86.20 86.33 
Gate-NOR 85.67 85.89 86.21 


Table 4: Fı (%) comparison over different experiments on 
CoNLL-2003 corpus. 


adapt this architecture by replacing LSTM with NORs or 
other variation of RNNs. So the universal architecture of all 
tested models is: embedding layer — dropout layer — Bi- 
RNN/Bi-NOR layer — CRF layer. Three hidden layer sizes 
are chosen such that the total number of parameters for the 
whole network is roughly 200 k, 400 k and 800 k, see Ta- 
ble[T] We apply 50% dropout after embedding layer. Initial 
learning rate is set to 0.005 and every epoch it is reduced 
by factor 0.95. The size of each minibatch is 20. We train 
all networks for 25 epochs and early stop the training when 
there are 5 epochs no improvement on validation set. 

Our results are summarized in the Table [4] Not surpris- 
ingly, all NORs perform much better than giant single RNN- 
ReLU model. As we can see, GRU performs the worst, fol- 
lowed by IRNN. Compared to GRU and IRNN, LSTM per- 
forms very well, especially when network size grows up. At 
the same time, all NOR models get superior performances 
than IRNN, GRU and LSTM. Among them, SS-NOR model 
get best results. 


Conclusion 


In conclusion, we introduced a novel kind of systems theory 
based neural networks called “Network Of Recurrent neu- 
ral network” (NOR) which views existing RNNs (for ex- 
ample, simple RNN, GRU, LSTM) as high-level neurons 
and then utilizes RNN neurons to design higher-level layers. 
Then we proposed several methodologies to design differ- 
ent NOR topologies according to the evolution of systems 
theory (Arthur and others 1993). We conducted experiments 
on three kinds of tasks (including sentiment classification, 
question type classification and named entity recognition) to 
evaluate our proposed models. Experimental results demon- 
strated that NOR models get superior performances com- 
pared with single giant RNN models, and sometimes their 
performances even exceed GRU and LSTM. 
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